Project 2


2: Get the Data


Task #2a: Load the Telcom Customer Churn dataset into a DataFrame.

Task #2b: Confirm the import by executing the head method on your DataFrame.

Task #2c: Determine the ‘shape’ of your DataFrame. Then list the fields with their datatypes using the info method.

In [1]:
# 2a

import pandas as pd

churn = pd.read_csv('churn.csv', delimiter = ',', thousands = ',') 

Install Plotly: pip install plotly==4.3.0

Install SMOTE: pip install -U imbalanced-learn

*At Anaconda Interface: click jupyter Notebook --> top right corner click new --> click Terminal --> pip...

In [2]:
# 2b
churn.head()
Out[2]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

In [3]:
# 2c
print(churn.shape)
churn.info() #checking number of null values
(7043, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
In [4]:
churn.dropna(inplace = True) # eliminate null values

churn.info() # same number of rows as previous, showing that no null values were dropped... interesting
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.2+ MB

3: Clean the Data


Task #3a: For customers with a missing Churn status (neither Yes, or No), fill-in the values to be ‘No’.

In [5]:
#3a
churn.Churn.value_counts() #checking again for null values in Series...5174 + 1869 = 7038 which shows that there are values not being read
Out[5]:
No     5174
Yes    1869
Name: Churn, dtype: int64
In [6]:
churn.Churn = churn.Churn.apply(lambda x: 'Yes' if x == 'Yes' else 'No' ) # recoding 
churn.Churn.value_counts() #still no change
Out[6]:
No     5174
Yes    1869
Name: Churn, dtype: int64

Task #3b: Drop any rows with more than one missing value.

In [7]:
#3b Dropped any N/A's



churn.dropna(inplace = True)
In [8]:
churn.customerID.value_counts() #still no change
Out[8]:
3005-NFMTA    1
7823-JSOAG    1
9544-PYPSJ    1
6896-SRVYQ    1
5999-LCXAO    1
             ..
8639-NHQEI    1
3769-MHZNV    1
5840-NVDCG    1
4343-EJVQB    1
0020-JDNXP    1
Name: customerID, Length: 7043, dtype: int64

Task #3c: Eliminate any duplicate records.

In [9]:
#3c no duplicate customer ID's but for good measure, showing any duplicates:

churn.customerID[churn.customerID.duplicated()]


# row count matches 7043
Out[9]:
Series([], Name: customerID, dtype: object)
In [10]:
churn.columns
Out[10]:
Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')
In [11]:
churn.InternetService.value_counts() # totals 7043 
Out[11]:
Fiber optic    3096
DSL            2421
No             1526
Name: InternetService, dtype: int64
In [12]:
churn.OnlineBackup.value_counts()
Out[12]:
No                     3088
Yes                    2429
No internet service    1526
Name: OnlineBackup, dtype: int64
In [13]:
churn.OnlineSecurity.value_counts()
Out[13]:
No                     3498
Yes                    2019
No internet service    1526
Name: OnlineSecurity, dtype: int64

Task #3d: Create a new column within the DataFrame called ‘Techsavvy’. Rows within this column will have a value of either high, medium, or low. High Techsavvy customers use both online security and online backup and have a Fiber Optic internet service. Medium Techsavvy customers also use Fiber Optic internet service and online security, but not online backup. All other customers are low Techsavvy. As we cannot really tell the level of tech-savvyness for customers who do not use the internet service (i.e., are ‘no’), do not give them any ‘Techsavvy’ rating – instead, put a placeholder value of ‘unknown’ in the Techsavvy column. Note: exclude customers with an ‘unknown’ Techsavvy value from any visualizations (see below).

In [14]:
# 3d creating new column

def tech_savy(x):
    if x.InternetService == 'Fiber optic':
        if x.OnlineSecurity =='Yes' and x.OnlineBackup == 'Yes':
                return 'High'
        elif x.OnlineSecurity =='Yes' and x.OnlineBackup == 'No':
                return 'Medium'
        else:
            return 'Low'
    elif x.InternetService =='DSL':
        return 'Low'
    elif x.InternetService =='No':
        return 'Unknown'
    
        
            
            
churn['Techsavvy'] = churn.apply(tech_savy, axis = 1)
In [15]:
churn.Techsavvy.value_counts() #7043 rows...all rows accounted for
Out[15]:
Low        4678
Unknown    1526
High        489
Medium      350
Name: Techsavvy, dtype: int64

Task #3e: Create a new column within the DataFrame called ‘Streamer’. Rows within this column will have a value of ‘streamer’ if they subscribe to both StreamingTV and StreamingMovies. Otherwise, a customer will have a streamer value of non-streamer. Again, preclude any customers that do not subscribe to any internet service – give these customers the value of ‘unknown’. Note: exclude customers with an ‘unknown’ Streamer’ value from any visualizations (see below).

In [16]:
# 3e creating a new column

def streamer(x):
    if x.InternetService == 'No':
        return 'Unknown'
    else:
        if x.StreamingMovies == 'Yes' and x.StreamingTV == 'Yes':
            return 'Streamer'
        else:
            return 'Non-streamer'



churn['Streamer'] =  churn.apply(streamer, axis = 1)
In [17]:
churn.Streamer.value_counts() 
Out[17]:
Non-streamer    3577
Streamer        1940
Unknown         1526
Name: Streamer, dtype: int64

Task #3f: Replace all occurrences of ‘No internet service’ with ‘NA’.

In [18]:
#3f replacing all No in internet service with NA

churn.InternetService = churn.InternetService.map({'Fiber optic': 'Fiber optic', 'DSL': 'DSL', 'No': 'NA'})
churn.InternetService.value_counts()
Out[18]:
Fiber optic    3096
DSL            2421
NA             1526
Name: InternetService, dtype: int64

Task #3g: Recode ‘Bank transfer (automatic)’ and ‘Credit card (automatic)’ to be ‘Automatic’

In [19]:
churn.PaymentMethod.value_counts() # 3g checking for all values
Out[19]:
Electronic check             2365
Mailed check                 1612
Bank transfer (automatic)    1544
Credit card (automatic)      1522
Name: PaymentMethod, dtype: int64
In [20]:
# maping automatic payments to Automatic
churn.PaymentMethod = churn.PaymentMethod.map({'Electronic check': 'Electronic check', 'Mailed check': 'Mailed check', 'Bank transfer (automatic)' : 'Automatic', 'Credit card (automatic)': 'Automatic'})
In [21]:
churn.PaymentMethod.value_counts() # Automatic does equal total of both previous credit card and bank transfer
Out[21]:
Automatic           3066
Electronic check    2365
Mailed check        1612
Name: PaymentMethod, dtype: int64

Task #3h: Create an explicit row index for your DataFrame using the customerID field. Name this index CustomerID.

In [22]:
#3h setting index

churn.set_index('customerID', inplace = True)
churn.head()
Out[22]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup ... StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn Techsavvy Streamer
customerID
7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes ... No No Month-to-month Yes Electronic check 29.85 29.85 No Low Non-streamer
5575-GNVDE Male 0 No No 34 Yes No DSL Yes No ... No No One year No Mailed check 56.95 1889.5 No Low Non-streamer
3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes ... No No Month-to-month Yes Mailed check 53.85 108.15 Yes Low Non-streamer
7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No ... No No One year No Automatic 42.30 1840.75 No Low Non-streamer
9237-HQITU Female 0 No No 2 Yes No Fiber optic No No ... No No Month-to-month Yes Electronic check 70.70 151.65 Yes Low Non-streamer

5 rows × 22 columns

Task #3i: Identify any potential outliers in the column MonthlyCharges. Plot the values to help identify any outliers. If you find potential outliers, change the monthly charge to match the ceiling amount.

In [23]:
churn.MonthlyCharges #3i 
Out[23]:
customerID
7590-VHVEG     29.85
5575-GNVDE     56.95
3668-QPYBK     53.85
7795-CFOCW     42.30
9237-HQITU     70.70
               ...  
6840-RESVB     84.80
2234-XADUH    103.20
4801-JZAZL     29.60
8361-LTMKD     74.40
3186-AJIEK    105.65
Name: MonthlyCharges, Length: 7043, dtype: float64
In [24]:
#3i no outliers identified ??

import seaborn as sns
import numpy as np
import plotly.express as px
sns.distplot(churn.MonthlyCharges)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f312091348>
In [25]:
fig = px.histogram(churn, x = churn.MonthlyCharges, nbins = 50)
fig.show()

Task #3j: Perform a cross-check of tenure multiplied by MonthlyCharges is equal TotalCharges. If the cross-check fails to be within a 10\% range of the recorded TotalCharges, then drop the entire record.

In [26]:
# converting all columns to number data types

#3j

churn.tenure = churn.tenure.astype(int)
churn.MonthlyCharges = churn.MonthlyCharges.astype(float)
In [27]:
# error at position 488 ---- BAD DATA ---drop rows with "  " for TotalCharges

churn.TotalCharges = pd.to_numeric(churn.TotalCharges)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string " "

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-27-40c8745cf0ef> in <module>
      1 # error at position 488 ---- BAD DATA ---drop rows with "  " for TotalCharges
      2 
----> 3 churn.TotalCharges = pd.to_numeric(churn.TotalCharges)

~\Anaconda3\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
    149             coerce_numeric = errors not in ("ignore", "raise")
    150             values = lib.maybe_convert_numeric(
--> 151                 values, set(), coerce_numeric=coerce_numeric
    152             )
    153 

pandas\_libs\lib.pyx in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string " " at position 488
In [28]:
#identifying number of rows with blank TotalCharges

churn.TotalCharges[churn.TotalCharges != " "] # row count 7032 which means there are 11 missing values
Out[28]:
customerID
7590-VHVEG      29.85
5575-GNVDE     1889.5
3668-QPYBK     108.15
7795-CFOCW    1840.75
9237-HQITU     151.65
               ...   
6840-RESVB     1990.5
2234-XADUH     7362.9
4801-JZAZL     346.45
8361-LTMKD      306.6
3186-AJIEK     6844.5
Name: TotalCharges, Length: 7032, dtype: object
In [29]:
churn.TotalCharges.dropna(inplace = True) #ineffective 
In [30]:
# row number matches

churn.shape # showing dropna is not working
Out[30]:
(7043, 22)
In [31]:
#changed all " " to 0 in order to transform column

churn.TotalCharges[churn.TotalCharges == " "] = 0
In [32]:
churn.TotalCharges[churn.TotalCharges != " "] # row count changed to 7043 showing that the 11 missing values have been recoded to 0 
Out[32]:
customerID
7590-VHVEG      29.85
5575-GNVDE     1889.5
3668-QPYBK     108.15
7795-CFOCW    1840.75
9237-HQITU     151.65
               ...   
6840-RESVB     1990.5
2234-XADUH     7362.9
4801-JZAZL     346.45
8361-LTMKD      306.6
3186-AJIEK     6844.5
Name: TotalCharges, Length: 7043, dtype: object
In [33]:
churn.TotalCharges = pd.to_numeric(churn.TotalCharges)
In [34]:
# creating a new column cross-check, this column is creating with a function that assigns Total Charges. If Total Charges = 0 then Cross Check is NaN, if Total Charges does not fall with 10% of monthly charges x tenure then Cross Check is assigned NaN.
# Cross Check rows assigned NaN will be removed later


def Cross_Check(x):
    if x.TotalCharges == 0:
        return 'NaN'
    else:
        a = (x.tenure*x.MonthlyCharges)/x.TotalCharges
        if a > .9 and  a < 1.10:
            return a
        else:
            return 'NaN'
    
churn['cross_check'] = churn.apply(Cross_Check, axis = 1) 
In [35]:
churn.cross_check.value_counts() #7042 total, 614 :1, 397: NaN
Out[35]:
1.0                   614
NaN                   397
0.9391304347826087      2
1.0419681620839363      2
1.054054054054054       2
                     ... 
0.9957003211596848      1
1.007497776083365       1
0.9602308291223303      1
1.0321002751452157      1
0.9528394995865455      1
Name: cross_check, Length: 6031, dtype: int64
In [36]:
#drop all rows not within 10% range

churn.drop(churn[churn['cross_check'] == 'NaN'].index, inplace = True)
In [37]:
# rows check out 7043 - 397 (NaN) = 6646

churn.info()
<class 'pandas.core.frame.DataFrame'>
Index: 6646 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 23 columns):
gender              6646 non-null object
SeniorCitizen       6646 non-null int64
Partner             6646 non-null object
Dependents          6646 non-null object
tenure              6646 non-null int32
PhoneService        6646 non-null object
MultipleLines       6646 non-null object
InternetService     6646 non-null object
OnlineSecurity      6646 non-null object
OnlineBackup        6646 non-null object
DeviceProtection    6646 non-null object
TechSupport         6646 non-null object
StreamingTV         6646 non-null object
StreamingMovies     6646 non-null object
Contract            6646 non-null object
PaperlessBilling    6646 non-null object
PaymentMethod       6646 non-null object
MonthlyCharges      6646 non-null float64
TotalCharges        6646 non-null float64
Churn               6646 non-null object
Techsavvy           6646 non-null object
Streamer            6646 non-null object
cross_check         6646 non-null object
dtypes: float64(2), int32(1), int64(1), object(19)
memory usage: 1.2+ MB

4: Structure


Task #4a: Categorize the ‘tenure’ field across four intervals. Ensure the interval cutoffs provide a balanced number (or close to that) of customers within each category and choose appropriate brief labels for each category. (Hint: you can use cut/qcut). Generate a visualization to display the distribution of customers across the categories to confirm the categorization has been performed correctly. Be careful to preserve the ‘tenure’ field, as it will be needed for our Logistic Regression analysis. Place the categorized tenure field (name this field: ‘TenureCat’) immediately to the right of the tenure field within the DataFrame.

In [38]:
#4a creating qcut of tenure into equalish sized bins

tenure = ['Infant','Child', 'Teen', 'Adult']


churn['TenureCat'] = pd.qcut(churn.tenure, 4, labels=tenure)

Note: I have create 4 Tenure Categories named: Infant, Child, Teen, Adult

In [39]:
churn.TenureCat.value_counts()# total of rows = 6646
Out[39]:
Teen      1714
Infant    1679
Child     1646
Adult     1607
Name: TenureCat, dtype: int64
In [40]:
# reordering columns
churn.columns
Out[40]:
Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn', 'Techsavvy', 'Streamer',
       'cross_check', 'TenureCat'],
      dtype='object')
In [41]:
churn = churn[['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure','TenureCat',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn', 'Techsavvy', 'Streamer',
       'cross_check']]

churn.columns
Out[41]:
Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'TenureCat', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn', 'Techsavvy',
       'Streamer', 'cross_check'],
      dtype='object')
In [42]:
churn.head()
Out[42]:
gender SeniorCitizen Partner Dependents tenure TenureCat PhoneService MultipleLines InternetService OnlineSecurity ... StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn Techsavvy Streamer cross_check
customerID
7590-VHVEG Female 0 Yes No 1 Infant No No phone service DSL No ... No Month-to-month Yes Electronic check 29.85 29.85 No Low Non-streamer 1
5575-GNVDE Male 0 No No 34 Teen Yes No DSL Yes ... No One year No Mailed check 56.95 1889.50 No Low Non-streamer 1.02477
3668-QPYBK Male 0 No No 2 Infant Yes No DSL Yes ... No Month-to-month Yes Mailed check 53.85 108.15 Yes Low Non-streamer 0.995839
7795-CFOCW Male 0 No No 45 Teen No No phone service DSL Yes ... No One year No Automatic 42.30 1840.75 No Low Non-streamer 1.03409
9237-HQITU Female 0 No No 2 Infant Yes No Fiber optic No ... No Month-to-month Yes Electronic check 70.70 151.65 Yes Low Non-streamer 0.93241

5 rows × 24 columns

In [43]:
import plotly.express as px

fig = px.histogram(churn, x = churn.TenureCat, color = churn.TenureCat)
fig.show()

Task #4b: For each of the tenure categories, provide a count of customers across the different contract types (one year, two year, etc.). Also report the average monthly charge across the different contract types. Plot this data in an appropriate visualization to promote visual comparison across the tenure categories.

In [44]:
#4b for each TenureCat provide a count of customers across the different contract types


px.histogram(churn, x = churn.TenureCat, color = churn.Contract, barmode = 'group')
In [45]:
#4b Continued: showing average MonthlyCharges by Contract and TenureCat 

u = churn.groupby(['Contract', 'TenureCat'])['MonthlyCharges'].mean()
u = u.unstack()
u
Out[45]:
TenureCat Infant Child Teen Adult
Contract
Month-to-month 59.187241 70.297854 79.303598 89.678846
One year 37.193443 51.640086 68.231575 83.620627
Two year 34.600000 34.773684 49.694125 70.551685
In [46]:
#graph to compare monthly charges across TenureCat by contract type

import plotly.graph_objects as go
import plotly.figure_factory as ff


table_data = u
fig = ff.create_table(table_data, height_constant=50)


Contract_type = u.columns
U1 = u.iloc[0]
U2 = u.iloc[1]
U3 = u.iloc[2]


trace1 = go.Bar(x=Contract_type, y=U1, xaxis='x2', yaxis='y2',
                marker=dict(color='lightblue'),
                name='Month<br>to<br>Mont')
trace2 = go.Bar(x=Contract_type, y=U2, xaxis='x2', yaxis='y2',
                marker=dict(color='salmon'),
                name='One<br>Year')

trace3 = go.Bar(x=Contract_type, y=U3, xaxis='x2', yaxis='y2',
                marker=dict(color='indianred'),
                name='Two<br>Year')


fig.add_traces([trace1, trace2, trace3])


fig['layout']['xaxis2'] = {}
fig['layout']['yaxis2'] = {}


fig.layout.yaxis.update({'domain': [0, .20]})
fig.layout.yaxis2.update({'domain': [.40, 1]})


fig.layout.yaxis2.update({'anchor': 'x2'})
fig.layout.xaxis2.update({'anchor': 'y2'})
fig.layout.yaxis2.update({'title': 'Monthly Charge in Dollars'})


fig.layout.margin.update({'t':75, 'l':50})
fig.layout.update({'title': 'Breakdown of Tenure Monthly Charge'})


fig.layout.update({'height':800})


fig.show()

7: Data Discovery and Exploration


Task #7a: Generate a countplot of Churn status.

In [47]:
#7a showing the count of each churn status

px.histogram(churn, x = churn.Churn, color = churn.Churn)

Task #7b: Generate a barplot comparing MonthlyCharges against InternetService.

In [48]:
#7b Generate a barplot comparing MonthlyCharges against InternetService
from plotly.subplots import make_subplots

# groupby to get the appropriate data
sum_IS = churn.groupby('InternetService')['MonthlyCharges'].sum()
mean_IS = churn.groupby('InternetService')['MonthlyCharges'].mean()


fig = make_subplots(rows=1, cols=2)


fig.add_bar(x = sum_IS.index, y=sum_IS,
            marker=dict(color="MediumPurple"), text = sum_IS, textposition ='outside',
            name="Total Monthly Charges", row=1, col=1)


fig.add_bar(x = sum_IS.index, y=mean_IS,
            marker=dict(color="LightSeaGreen"),
            name="Average Monthly Charges",text = sum_IS, textposition = 'outside', row=1, col=2, )

fig.layout.title.text = "Monthly Charges by Internet Service"

fig.show()

Task #7c: Generate a violinplot comparing Churn status against InternetService.

In [49]:
#7c Generate a violinplot comparing MonthlyCharges against InternetService. Select Churn as the hue.
mean_IS = churn.groupby(['InternetService', 'Churn'])['MonthlyCharges'].mean()
mean_IS = mean_IS.unstack(1)
                         
                         
y_1 = mean_IS.Yes
y_2 = mean_IS.No

fig = go.Figure()

fig.add_trace(go.Violin(
                            y=y_1,
                            name='Churn = Yes',
                            box_visible=True,
                            meanline_visible=True))

fig.add_trace(go.Violin(
                            y=y_2,
                            name='Churn = No',
                            box_visible=True,
                            meanline_visible=True))

fig.layout.title.text = "Monthly Charges by Churn Status and InternetService"

fig.show()
In [50]:
mean_IS
Out[50]:
Churn No Yes
InternetService
DSL 60.698384 49.984444
Fiber optic 94.128866 88.312179
NA 21.196634 20.398515

Task #7d: Generate a series of plots comparing churn status against tenure, MonthlyCharges, TotalCharges, Techsavvy, SeniorCitizen, TenureCat, and Streamer. Based on the visualizations performed so far, provide at least 3 preliminary observations – for example, do these visualizations provide any indicators of possible good predictors of Churn status? Please record your observations within the Notebook.

In [51]:
#7d Generate a series of plots (Churn vs TotalCharges, Techsavvy, MonthlyCharges, SeniorCitizen, TenureCat, Streamer)
In [52]:
import plotly.graph_objects as go

ch_totalcharge = churn.groupby('Churn')['TotalCharges'].mean()
px.bar(x = ch_totalcharge.index, y = ch_totalcharge)


x = ch_totalcharge.index
y = [ch_totalcharge[0],ch_totalcharge[1]]


fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon'],
            textposition='auto',
        )])

fig.update_layout(title_text='Churn Status vs Average Total Charge')

fig.show()
In [53]:
ch_techsavvy = churn.groupby(['Techsavvy','Churn'])['Churn'].count()
ch_techsavvy = ch_techsavvy.unstack(1)

x = ch_techsavvy.index
y_1 = ch_techsavvy.No
y_2 = ch_techsavvy.Yes


fig = go.Figure()



fig.add_trace(go.Bar(
            x=x, y=y_1,
            text=y_1,
            name = 'No Churn',
            marker_color = 'lightblue',
            textposition='auto',
        ))


fig.add_trace(go.Bar(
            x=x, y=y_2,
            text=y_2,
            name = 'Yes Churn',
            marker_color = 'salmon',
            textposition='auto',
        ))
fig.update_layout(title_text='Churn Status vs TechSavvy Status')

fig.show()
In [54]:
ch_monthlycharges = churn.groupby('Churn')['MonthlyCharges'].sum()

x = ch_monthlycharges.index
y = [ch_monthlycharges[0],ch_monthlycharges[1]]


fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon'],
            textposition='auto',
        )])

fig.update_layout(title_text='Churn Status vs Total Monthly Charges')

fig.show()
In [55]:
## Clo suggestion: easier to see that there is no pattern I think, need to change the label if we go with that

churn['MonthlyCat']=pd.qcut(churn['MonthlyCharges'], q=20, labels=['1', '2', '3', '4', '5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20'])
In [56]:
px.histogram(churn, x = churn.MonthlyCat, color = churn.Churn, barmode = 'group')
In [57]:
churn.drop('MonthlyCat', axis = 1, inplace = True)
In [58]:
churn.SeniorCitizen = churn.SeniorCitizen.apply(lambda x: 'No' if x == 0 else 'Yes')
In [59]:
ch_seniorcitizen = churn.groupby('SeniorCitizen')['Churn'].count()


x = ch_seniorcitizen.index
y = [ch_seniorcitizen[0],ch_seniorcitizen[1]]


fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon'],
            textposition='auto',
        )])

fig.update_layout(title_text='Churn Status vs Senior Citizen')

fig.show()
In [60]:
ch_tenurecat = churn.groupby(['TenureCat', 'Churn'])['Churn'].count()
ch_tenurecat = ch_tenurecat.unstack(1)

x = ch_tenurecat.index
y_1 = ch_tenurecat.No
y_2 = ch_tenurecat.Yes


fig = go.Figure()



fig.add_trace(go.Bar(
            x=x, y=y_1,
            text=y_1,
            name = 'No Churn',
            marker_color = 'lightblue',
            textposition='auto',
        ))


fig.add_trace(go.Bar(
            x=x, y=y_2,
            text=y_2,
            name = 'Yes Churn',
            marker_color = 'salmon',
            textposition='auto',
        ))
fig.update_layout(title_text='Churn Status vs TenureCat Status')

fig.show()
In [61]:
ch_streamer = churn.groupby(['Streamer', 'Churn'])['Churn'].count()
ch_streamer = ch_streamer.unstack(1)

x = ch_streamer.index
y_1 = ch_streamer.No
y_2 = ch_streamer.Yes


fig = go.Figure()



fig.add_trace(go.Bar(
            x=x, y=y_1,
            text=y_1,
            name = 'No Churn',
            marker_color = 'lightblue',
            textposition='auto',
        ))


fig.add_trace(go.Bar(
            x=x, y=y_2,
            text=y_2,
            name = 'Yes Churn',
            marker_color = 'salmon',
            textposition='auto',
        
        ))
fig.update_layout(title_text='Churn Status vs Streamer Status')

fig.show()
  • Fiber Optic Internet service accounts for almost half of all revenues according to graph in 7b
  • Customers who have NOT churned are spending more than those who have abandoned service
  • Customers that are 'Medium' Techsavvy have a much higher rate of churn (102/237 = 43%)
  • TechSavvy 'High' customers have a very low churn rate
  • Newer customers (Infants) have a much higher churn rate, most likely due to the high percentage of month-to-month contracts
  • Senior Citizens seems also a good predictor for churn status

Task #7e: Calculate categorical means (i.e., means for each category) for MonthlyCharges across the InternetService, TenureCat’ and Techsavvy categories. Also, calculate categorical means for Tenure across the InternetService, Techsavvy and SeniorCitizen categories. Hint: use groupby. Plot the two most interesting results.

In [62]:
#7e categorical means MonthlyCharges: InternetService, TenureCat, TechSavvy. Tenure: InternetService, TenureCat, SeniorCitizen

mc_internetservice = churn.groupby('InternetService')['MonthlyCharges'].mean()
x = mc_internetservice.index
y = mc_internetservice

fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
            textposition='auto',
    
        )])

fig.update_layout(title_text='MonthlyCharges vs InternetService')

fig.show()
In [63]:
churn.InternetService
Out[63]:
customerID
7590-VHVEG            DSL
5575-GNVDE            DSL
3668-QPYBK            DSL
7795-CFOCW            DSL
9237-HQITU    Fiber optic
                 ...     
6840-RESVB            DSL
2234-XADUH    Fiber optic
4801-JZAZL            DSL
8361-LTMKD    Fiber optic
3186-AJIEK    Fiber optic
Name: InternetService, Length: 6646, dtype: object
In [64]:
mc_tenurecat = churn.groupby('TenureCat')['MonthlyCharges'].mean()
x = mc_tenurecat.index
y = mc_tenurecat

fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
            textposition='auto',
        )])

fig.update_layout(title_text='MonthlyCharges vs Tenure Category')

fig.show()
In [65]:
#thoughts: is there any way to put medium between low and high? to change the order

mc_techsavvy = churn.groupby('Techsavvy')['MonthlyCharges'].mean()
x = mc_techsavvy.index
y = mc_techsavvy


fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
            textposition='auto',
            
        )])

fig.update_layout(title_text='MonthlyCharges vs Techsavvy')
fig.update_layout(xaxis={'categoryorder':'mean descending'})

fig.show()
In [66]:
tn_internetservice = churn.groupby('InternetService')['tenure'].mean()
x = tn_internetservice.index
y = tn_internetservice


fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon', 'indianred'],
            textposition='auto',
        )])

fig.update_layout(title_text='Average Tenure vs InternetService')

fig.show()
In [67]:
tn_tenureCat = churn.groupby('TenureCat')['tenure'].mean()
x = tn_tenureCat.index
y = tn_tenureCat


fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon', 'indianred', '#330C73'],
            textposition='auto',
        )])

fig.update_layout(title_text='Tenure vs Tenure Category')

fig.show()
In [68]:
tn_seniorcitizen = churn.groupby('SeniorCitizen')['tenure'].mean()
x = tn_seniorcitizen.index
y = tn_seniorcitizen


fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            marker_color = ['lightblue', 'salmon'],
            textposition='auto',
        )])

fig.update_layout(title_text='Senior Citizen vs Tenure Category')

fig.show()
  • The plot where we can see the correlation between the Monthly charges and the Tech savy is interesting because we can see that the monthly charges is bigger when the customer is more tech savvy.

Task #7f: Generate a single kdeplot plot comparing the distributions of MonthlyCharges and (TotalCharges divided by tenure).

In [69]:
#7f   Generate a single kdeplot plot comparing the distributions of MonthlyCharges and (TotalCharges divided by tenure)



import plotly.figure_factory as ff

a = churn.MonthlyCharges
b = churn.TotalCharges/churn.tenure

colorscale = ['#7A4579', '#D56073', 'rgb(236,158,105)', (1, 1, 0.2), (0.98,0.98,0.98)]

fig = ff.create_2d_density(
    a, b, colorscale=colorscale,
    hist_color='rgb(255, 237, 222)', point_size=3
)
fig.layout.update({'height':800})
fig.layout.update({'width':1000})
fig.update_layout(title_text='Monthly Charges vs TotalCharges/Tenure')

fig.show()
#
In [70]:
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
import seaborn as sns

plt.figure(figsize=(15,8))
ax = sns.kdeplot(a, color="lightblue", shade=True)
sns.kdeplot(b, color="salmon", shade=True)
plt.legend(['Monthly Charges', 'Total Charges over Tenure'])
plt.title('Density Plot')
ax.set(xlabel='Monthly_Charges')
plt.xlim(-10,120)
plt.show()

Task #7g: Generate a kdeplot comparing tenure and monthly charges. As these variables are clearly on different scales, you will need to first normalize the values in both variables (use the min-max normalization approach) and also remove any distorting outliers.

In [71]:
#7g Density plot comparing MonthlyCharges vs Tenure
import sklearn.preprocessing

a = churn.MonthlyCharges 
b = churn.tenure

x = sklearn.preprocessing.minmax_scale(a, feature_range=(0, 1), axis=0, copy=True)
y = sklearn.preprocessing.minmax_scale(b, feature_range=(0, 1), axis=0, copy=True)

colorscale = ['#7A4579', '#D56073', 'rgb(236,158,105)', (1, 1, 0.2), (0.98,0.98,0.98)]

fig = ff.create_2d_density(
    x, y, colorscale=colorscale,
    hist_color='rgb(255, 237, 222)', point_size=3
)
fig.layout.update({'height':800})
fig.layout.update({'width':1000})
fig.update_layout(title_text='Monthly Charges vs Tenure')

fig.show()
In [72]:
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
import seaborn as sns

plt.figure(figsize=(15,8))
ax = sns.kdeplot(x, color="lightblue", shade=True)
sns.kdeplot(y, color="salmon", shade=True)
plt.legend(['Tenure', 'Monthly Charges'])
plt.title('Density Plot')
ax.set(xlabel='Monthly_Charges')
plt.xlim(-.25,1.25)
plt.show()

Task #7h: Generate a plot investigating whether the dataset has a gender bias.

In [73]:
#7h countplot of male vs female
import plotly.express as px
px.histogram(churn, x = churn.gender, color = churn.gender)
  • There is no gender bias for dataset

Task #7i: Generate a series of box plots showing MonthlyCharges against these fields: Techsavvy, Streamer, SeniorCitizen, and InternetService.

In [74]:
px.box(churn, x='Techsavvy', y='MonthlyCharges', color = 'Techsavvy')
In [75]:
px.box(churn, x='Streamer', y='MonthlyCharges', color = 'Streamer')
In [76]:
px.box(churn, x='SeniorCitizen', y='MonthlyCharges', color='SeniorCitizen')
In [ ]:
 
In [77]:
px.box(churn, x='InternetService', y='MonthlyCharges', color='InternetService')

8: Feature Selection


For our intended analytical method – binary linear regression – the independent variables should be independent of each other: that is, the model should have little or no multicollinearity. We would therefore exclude the following fields from further analysis: tenure, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies, as we have created other fields based on these fields. Moreover, we would expect that the TotalCharges would be highly correlated with tenure and MonthlyCharges – accordingly, we would also exclude TotalCharges from our model.

For brevity of analysis, we will now move forward to our modeling stage with a focus on only the following fields:

  1. Dependent variable: Churn
  2. Independent variables: SeniorCitizen, Streamer, Techsavvy, TenureCat, MonthlyCharges, PaymentMethod. Go ahead and drop all other fields from the DataFrame (but keep the row index).
In [78]:
# 8 Creating new dataframe
drop_list = ['tenure', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'TotalCharges', 'Partner','Dependents','PhoneService','MultipleLines','PaperlessBilling', 'cross_check', 'Contract','gender']
churn_new = churn.drop(drop_list, axis = 1)
In [79]:
churn_new.to_csv('churn_new')

9: Modeling


Task #9a: For logistic regression, categorical variables need to be dummied (create dummy variables). As explanation, you cannot directly feed into the model labelled data (e.g., low, medium, high), as it has no clue what you are talking about (i.e., the order) and has trouble comparing values across variables. To address this issue, we need to recode the categorical variables. Pandas has a specific capability to help with this type of recoding – the getdummies method. Application of this method will produce an Indicator matrix (DataFrame), which will need to be merged / concatenated back with the main DataFrame. Also, as several categorical fields feature more than two categories, you will need to create special fields within the DataFrame to accommodate this.

Task #9b: Check to ensure all categorical columns have 1,0s (True/False) values. You can do this through visual inspection, using head.

In [80]:
#9a and 9b
churn_dummies = pd.get_dummies(churn_new)
churn_dummies = churn_dummies.drop(['Churn_No', 'SeniorCitizen_No'], axis = 1)
churn_dummies.head()
Out[80]:
MonthlyCharges SeniorCitizen_Yes TenureCat_Infant TenureCat_Child TenureCat_Teen TenureCat_Adult PaymentMethod_Automatic PaymentMethod_Electronic check PaymentMethod_Mailed check Churn_Yes Techsavvy_High Techsavvy_Low Techsavvy_Medium Techsavvy_Unknown Streamer_Non-streamer Streamer_Streamer Streamer_Unknown
customerID
7590-VHVEG 29.85 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0
5575-GNVDE 56.95 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0
3668-QPYBK 53.85 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0
7795-CFOCW 42.30 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0
9237-HQITU 70.70 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0
In [81]:
churn_dummies.loc[:, churn_dummies.columns != 'MonthlyCharges'].astype('category')
Out[81]:
SeniorCitizen_Yes TenureCat_Infant TenureCat_Child TenureCat_Teen TenureCat_Adult PaymentMethod_Automatic PaymentMethod_Electronic check PaymentMethod_Mailed check Churn_Yes Techsavvy_High Techsavvy_Low Techsavvy_Medium Techsavvy_Unknown Streamer_Non-streamer Streamer_Streamer Streamer_Unknown
customerID
7590-VHVEG 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0
5575-GNVDE 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0
3668-QPYBK 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0
7795-CFOCW 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0
9237-HQITU 0 1 0 0 0 0 1 0 1 0 1 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6840-RESVB 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0
2234-XADUH 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0
4801-JZAZL 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0
8361-LTMKD 1 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0
3186-AJIEK 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0

6646 rows × 16 columns

Task #9c: With the dataset now fully prepped for modeling, perform a binary linear regression analysis. Use the scikit-learn package: specifically, the sklearn.linear_model.LogisticRegression class. Given the technical nature of the official documentation, you may find a blog post demonstrating application of this method of some assistance (for example, see below). Note: as the focus of this project is not on mastering methods, it is not necessary to have the application of this method reflect all best practices aimed at model evaluation (e.g., recursive feature elimination, analysis of ROC, and confusion matrix analysis).

https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
https://scikit-learn.org/stable/index.html
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [82]:
#9c

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from imblearn.over_sampling import SMOTE
import numpy as np

X = churn_dummies.loc[:, churn_dummies.columns != 'Churn_Yes']
y = churn_dummies.loc[:, churn_dummies.columns == 'Churn_Yes']
In [83]:
#checking variance_inflation_factors

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:185: RuntimeWarning:

divide by zero encountered in double_scalars

In [84]:
#checking variance_inflation_factors
vif
Out[84]:
VIF Factor features
0 4.460551 MonthlyCharges
1 1.067002 SeniorCitizen_Yes
2 inf TenureCat_Infant
3 inf TenureCat_Child
4 inf TenureCat_Teen
5 inf TenureCat_Adult
6 inf PaymentMethod_Automatic
7 inf PaymentMethod_Electronic check
8 inf PaymentMethod_Mailed check
9 inf Techsavvy_High
10 inf Techsavvy_Low
11 inf Techsavvy_Medium
12 inf Techsavvy_Unknown
13 inf Streamer_Non-streamer
14 inf Streamer_Streamer
15 inf Streamer_Unknown
In [85]:
X_2 = X.drop(['Techsavvy_Medium', 'Streamer_Unknown', 'TenureCat_Adult', 'PaymentMethod_Mailed check' ], axis = 1)
In [86]:
import statsmodels.api as sm





logit_model=sm.Logit(y,X_2)
result=logit_model.fit()
print(result.summary())
Optimization terminated successfully.
         Current function value: inf
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:              Churn_Yes   No. Observations:                 6646
Model:                          Logit   Df Residuals:                     6634
Method:                           MLE   Df Model:                           11
Date:                Mon, 09 Dec 2019   Pseudo R-squ.:                     inf
Time:                        13:55:46   Log-Likelihood:                   -inf
converged:                       True   LL-Null:                        0.0000
Covariance Type:            nonrobust   LLR p-value:                     1.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
MonthlyCharges                     0.0242      0.002     10.899      0.000       0.020       0.029
SeniorCitizen_Yes                  0.5572      0.082      6.819      0.000       0.397       0.717
TenureCat_Infant                   3.1145      0.130     23.919      0.000       2.859       3.370
TenureCat_Child                    1.8066      0.123     14.661      0.000       1.565       2.048
TenureCat_Teen                     1.1113      0.121      9.162      0.000       0.874       1.349
PaymentMethod_Automatic            0.0081      0.100      0.081      0.935      -0.187       0.203
PaymentMethod_Electronic check     0.6595      0.095      6.938      0.000       0.473       0.846
Techsavvy_High                    -0.3240      0.190     -1.710      0.087      -0.695       0.047
Techsavvy_Low                      0.2735      0.140      1.956      0.050      -0.001       0.548
Techsavvy_Unknown                 -5.1331      0.175    -29.357      0.000      -5.476      -4.790
Streamer_Non-streamer             -5.0534      0.264    -19.115      0.000      -5.572      -4.535
Streamer_Streamer                 -4.9674      0.292    -17.001      0.000      -5.540      -4.395
==================================================================================================
C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py:1789: RuntimeWarning:

divide by zero encountered in log

C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning:

Inverting hessian failed, no bse or cov_params available

C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning:

Inverting hessian failed, no bse or cov_params available

10

In [87]:
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X_2.values, i) for i in range(X_2.shape[1])]
vif["features"] = X_2.columns
vif
Out[87]:
VIF Factor features
0 4.460551 MonthlyCharges
1 1.067002 SeniorCitizen_Yes
2 2.041885 TenureCat_Infant
3 1.770848 TenureCat_Child
4 1.626470 TenureCat_Teen
5 2.024099 PaymentMethod_Automatic
6 2.055850 PaymentMethod_Electronic check
7 2.328867 Techsavvy_High
8 4.972511 Techsavvy_Low
9 2.912105 Techsavvy_Unknown
10 32.820043 Streamer_Non-streamer
11 23.691759 Streamer_Streamer
In [88]:
#droping Streamer_Non-streamer for multicollinearity 
#droping PaymentMethod_Automatic because of P-value > .05
X_3 = X_2.drop(['Streamer_Non-streamer', 'PaymentMethod_Automatic' ], axis = 1)
X_3
Out[88]:
MonthlyCharges SeniorCitizen_Yes TenureCat_Infant TenureCat_Child TenureCat_Teen PaymentMethod_Electronic check Techsavvy_High Techsavvy_Low Techsavvy_Unknown Streamer_Streamer
customerID
7590-VHVEG 29.85 0 1 0 0 1 0 1 0 0
5575-GNVDE 56.95 0 0 0 1 0 0 1 0 0
3668-QPYBK 53.85 0 1 0 0 0 0 1 0 0
7795-CFOCW 42.30 0 0 0 1 0 0 1 0 0
9237-HQITU 70.70 0 1 0 0 1 0 1 0 0
... ... ... ... ... ... ... ... ... ... ...
6840-RESVB 84.80 0 0 1 0 0 0 1 0 1
2234-XADUH 103.20 0 0 0 0 0 0 1 0 1
4801-JZAZL 29.60 0 0 1 0 1 0 1 0 0
8361-LTMKD 74.40 1 1 0 0 0 0 1 0 0
3186-AJIEK 105.65 0 0 0 0 0 0 0 0 1

6646 rows × 10 columns

In [89]:
logit_model=sm.Logit(y,X_3)
result=logit_model.fit()
print(result.summary())
Optimization terminated successfully.
         Current function value: inf
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:              Churn_Yes   No. Observations:                 6646
Model:                          Logit   Df Residuals:                     6636
Method:                           MLE   Df Model:                            9
Date:                Mon, 09 Dec 2019   Pseudo R-squ.:                     inf
Time:                        13:55:51   Log-Likelihood:                   -inf
converged:                       True   LL-Null:                        0.0000
Covariance Type:            nonrobust   LLR p-value:                     1.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
MonthlyCharges                    -0.0097      0.001     -7.359      0.000      -0.012      -0.007
SeniorCitizen_Yes                  0.5746      0.079      7.261      0.000       0.419       0.730
TenureCat_Infant                   2.0356      0.099     20.469      0.000       1.841       2.230
TenureCat_Child                    0.8691      0.098      8.878      0.000       0.677       1.061
TenureCat_Teen                     0.2951      0.099      2.970      0.003       0.100       0.490
PaymentMethod_Electronic check     0.7297      0.066     11.039      0.000       0.600       0.859
Techsavvy_High                    -1.6253      0.167     -9.748      0.000      -1.952      -1.298
Techsavvy_Low                     -1.5728      0.096    -16.412      0.000      -1.761      -1.385
Techsavvy_Unknown                 -3.4955      0.130    -26.830      0.000      -3.751      -3.240
Streamer_Streamer                  0.4232      0.084      5.052      0.000       0.259       0.587
==================================================================================================
C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py:1789: RuntimeWarning:

divide by zero encountered in log

C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning:

Inverting hessian failed, no bse or cov_params available

C:\Users\Peter\Anaconda3\lib\site-packages\statsmodels\base\model.py:492: HessianInversionWarning:

Inverting hessian failed, no bse or cov_params available

  • Newer customers (infants and children) are more liable to churn
  • This information verifies 2 previous Exploratory Data Analysis charts: TenureCat vs Churn, the density plots of Tenure vs Monthly Charge